DOCUMEUT RESUME 



ED 091 427 



TM 003 641 



AUTHOR 
TITLE 

SPONS AGENCY 
PUB DATE 
NOTE 



EDRS PRICE 
DESCRIPTORS 



St eg man, Charles £• 

Subjective Probability and the Administration of 
Objective Tests. 

Pittsburgh Univ • , Pa • School of Education • 
[73] 

37p,; Paper presented at the Annual Convocation of 
the Northeastern Educational Research Association 
(4thf Ellenville, New York, October 31-November 2, 
1973) 

MF-$0.75 HC-$1,85 PLUS POSTAGE 
Annotated Bibliographies; ^Confidence Testing; 
Guessing (Tests) ; *Measurement Techniques; Multiple 
Choice Tests; *0b jective Tests ; Research Needs ; 
Scoring; Testing; Test Reliability; Test Validity 



ABSTRACT 

Probabilistic testing involves having the examinee 
assign probabilities to each of the options of a multiple-choice 
item. These probabilities reflect the student's perception of the 
correctness of each option. What is presented in the paper is a 
rationale for probability testing, the current theoretical and 
empirical findings, and some suggested directions for further 
research. The rationale given for considering probabilistic testing 
includes the following points. First, testing involves making 
decisions under uncertainty as do many situations faced every day and 
as such should fce solved by using a subjective probability decision 
theoretic paradigm. Second, using multiple-choice testing situations 
may be a good way of teaching the subjective probability decision 
theoretic paradigm. Third, probability testing procedures should lead 
to more reliable and possibly more valid tests. Fourth, probability 
testing in conjunction with specific utility functions yields a way 
of incorporating and handling "risk" and "guessing" behavior in 
testing situations. An annotated bibliography is also included to 
introduce potential researchers to the general area of confidence 
testing. (Author) 



us DEPAttTMENTOP HEALTH, 
EDUCATION & WELPAME 
NATIONAL INSTITUTE OF 
EDUCATION 

THIS D0CU.VL-N1 MAS hPfN WPPWO 
OUCED KVACU.Y Ai WUCEIVTiD »«CW 
1H(: PEtt^ONOW OKGANIZATIONOWIGIN 

A riNG 1 r POtN t s o^ vit:vv oiv opinions 

MATfrO DO Not NC: C I: SS A« il r Wl-PHF: 
SCN t 0> MCI A» NA I lONAL INM I tU t fc" 01- 
R Due A t ION POS) TiON POlK v 



Subjective Probability and the Administration 
of Ob jocti VG Tests 

Ciuirles E, Steqrnan 
University of Pittsburgh 



er Presented of the 4th Annual Convocation of the 
Northeastern (iducationc)! Research Associrition 
FdI isview Hotel El lenvi I le, Nev; York 



October 31 - l<Iovefrbor 2, 1973 



Subjective Probability and the Aclfni nistration 
of Object! ve Tosts^ 

Char I OS Steqman 
IJniverrJty of Pit't?burqh 

I ntroduct I on 

The widespread use of objective test? boqan about forty years ago. Two 
persistent concerns of measurornent specialists recardinp objective tests since 
then have been the developir.ent of methods for control I inq nuessinq behavior 
and of taking into account partial knoM'ledqe. Failure to take into account 
quessinq behavior and partial knowledoo, in the usual 1-0 scorinn rule for 
correct- i ncorrect responses, lias led nany to conclude that objective item 
scores result in a rather crude aoproxifnation of a person's actual position on 
the continuum of the variable beinq nea'fiured , 

It may be argued, of course, that this concern is misplaced. If one as- 
sumes a homogeneous item set, where the probability of a correct response by 
person i remains constant across all k Itefns in tlie set, then one should be 
concerned with p.k, and not individual item scores, which cannot equal p, ex- 
cept when p. is I or 0, 

While tlie above is theoretically true, it is also true that decisions are 
made on the basis of cither item scores or small subsets of Item scores, that 
is, subtests or scales of test batteries. The trend toward criterion-refer- ■ 
enced measurement indicates that more, rather than less, emphasis will be 
placed on the evaluation of iiem responses, where these responses are assumed 
to represent a sarnpi.e of behnvior(s) from some domain. The honr;oqene,i ty charac- 

■^This research was supported in part by a grant from the. Faculty Research 
Fund, School of Education, University of Pittsburgh, 



2 

terlstic Is a thorny problem, since honxvvjnoity can bo only partially attained. 
Thus for, tho coinploxity of cocinitive nroco)i:,sc?5 has kept ahead of the itemwri- 
tor^s attempts at devetopinc) tlie si-ai'istlcatly and psycho lop i ca t ly homoqeneous 
i tefri set, 

Di ssat I sf act ion :/ i th both the conventional rnethodrs of administering and 
scoring objective to?ts and with 'I'he irothods advanced I'o cofnpensate for the 
various deficiencies has led several r!:casuren>^nt specialists to suggest alter- 
native methods for admi n i steri nq cTid scor i nq ohjoctive tests. Those proposed 
r.ethods have had the common object ivo of iinpr-ovod precision .in the form of 
greater reliability and validity. 

The proposed methods include conf i denoe-woi nhti nq (H?7ner, 1932; Soderquist, 
1935; Ebel, 19653,. 1965b), option-elimination (Cco'rbs, 1953; Coombs et al., 
1955), and probabilistic tostinq (do Finetti, 1965; SI-iuford et al., 1956). In 
confidence-weighting the examinvje selecis the perceived correct option to a mul- 
tiple-choice question, then indicates his certainty of its correctness on an 
accompanying confidence scale. Item scores are dependent upon these two fac- 
tors, accuracy and expressed confidence. Option-elimination requires the stu- 
dent to identify as many of the n- I distractors as possible from the set of n 
options for the multiple-choice item. I fern scores are a function of the number 
of correct identifications with a penalty for misi denti fyi ng the correct answer 
as a distractor. Probabilistic testing involves having too 'examinee assicn pro- 
babilities to each of the n options of a multiple-choice item. These probab i - 
lities reflect. the studont^s perception of tho correctness of each option. 

In comparing tho above three methods it can be argued that confidence- 
weighting and 0|)tion-e) imination are approximations to probabilistic testing. 
Confidence-weighting is simply a partial version of the last approach since only 
one option (the perceived correct answer) is weighted. Also, the confidence 



weight asfit t-jnod is u^ujilly llrnli"i?d to only foLT values, while In probabl 1 1 stIc 
testinci the probabilily v/oiqiit con bo oi^f3onvially nny nurber between zero and 
one. Optiofi-o I imi not j on .Mpproxi rates p robab 1 I 1 v}ti c terrMnq since the student 
Implicitly weights the optior.;i and then di chotcrni zes i^he options Into two sets: 
(I) perceived distroctors cind (2) one or more opiion-i thouqht to contain the 
right answer. There is no attempt ot explicitly ronsurin--] the weiqhts attached 
to the elon;ents of sot (I) or, more importantly, set (2) when It contains more 
than one option. ^io Finotti (1965), in dis^ussinq 0[)tion elimination, derives 
formulae for the threshold values necessary for the individual to eliminate a 
given option. Thai* is, one can work backwards frofn the options eliminated to 
set bounds on the probabilily of "correctno^.s" cissociated with them. 

Echternacht (1972) discussed these thrne rrcthods under the general heading 
of confidence testinq. The purpose of his paper is ''to cjescribe the various 
forms of confidence testing as they liavo been developed and to provide a brief 
evaluation of these forms'' (p, 217). His paper orosents a qood review of liter- 
ature and overview of the area of "confidence testing," 

The present study will limit itself to [ichtc-irnacht ' s subcategory "proba- 
bility testing" which is associated wi rh the personal probability approach of 
de Finetti. Whnt.wlll be attefrptcd here is to present the rationale for proba- 
bility testing, to identify the theoretical and empirical findings, and to sug- 
gest some directions for further resoarcli." It is assumed that the reader is 
basically familiar with what ''probability testing" is, at least at the level of 
Echternacht (1972), and Lord and Novi ck '( L9G8, pp. 314-323). 

Rat lona le 

Before considering the "measurement" rationale for probability testing it 
is important to note do Finetti in 1965 was attempting to apj/Ly a philosophy 
of rational decision making under unceri'ainty -^o some problems associated with 



4 



objective tostinn. nil? ptiilosophy» which if3 deoonclent on subjectWo proba- 
bility, is intended to apply 1o all situcriions In'lifo involving decision 
naakinc] under uncertainty and takinq "objerTlvo tonti^'' is only one such situa- 
tion. A basic postulate of this philosophy is thot: 

Wg are ahvoys li'vinn and deal Inn, !fj conditions of uncor- 
tainty. If nrobcib i i isri c tl'iinkinn 's to bo the cjuide in 
facino uncortainty, it is essefitial thr.it vje I earn how to 
do it ^corrcjctlyJ .Jo know tho rules of probability and 
to bo aequo inted with their practical <3pp I i cation is to 
free us from the danqor of inconsistency, (do Finotti, 
1970, p. 38) 

This postulate is certainly not limited only to subjective probabi lists. 
It has been a basic tenet of psycholotiy for at least thirty years. Eqon Bruns- 
wik was one of the first osycholoqi sts to arque for considerfnq the probabilis- 
tic nature of life in dcsigninq psycnoloq! cal exper irrents. In 1943 he stated 
his position as follows: 

On the wholo, only ^icuticrod recoanition has be^en aiven 
to tho fact that objoct-cuo an^! rorins-^^nd re Icitfonsh Ips 
do not hold with tho certainty obtained in the nomothetic 
study of the so-C'illod laws of nature, but are rather of 
the character'of probability re laticn.^h i ps . This defi- 
• • ciency is r.ore clearly reflected in the psychology of 

learnincj which has procoodod alr.ost exclusively alonq a 
di alecti cal ly dichotomized all-or-nono pattern of "cor- 
rect vs . incorrect/' "ri qht'vs . wrong. ^' Situations in 
which fcod can be found always to the riqht and never to 
the left, or always behind a black door ond never^ befiind 
a white one, are not reprcsentati vo of tho structure of 
tho environment... Thoy are thus not sound as experimen- 
tal devices from the standpoint of a psycho I oqy which 
v-^ishes to lonrnp above all otin-r tf)inns, sonr.ethina about 
behavior und<ir conditions renresentat i ve of actual I i fe. . . 
I have expanded on thfs subject to 3uch an extent^e- 
cause I believe that the probability character of the 
causal (part/a! cause-and-ef feet) relationships in the 
envi ronrent cal Is for a fundamental, a I I - 1 ncl us i ve sh i f t 
in our n:ethodo loqi ca I ideoloqy reqardinq psycholoqy. 
(p. 260-261) 

Further, when ccnfrontod with uncertainty 

, All a finite, sub-divino individual can do when actinq 

is~tp use a term of Rei chenbach--to make a posit, or wa- ' ^ 



asr. The bch"t ho con do is to comnrorniso between cues so 
that hts'posit appr'oaches the ^best bet' on basis of 
all the probnbi ll'hies., or past rcilativo froqusncies, or 
relovr3nt interrelationships lumpod toqethor. (p. 259) 

In a similar manner HI Iqard (1951) says: 

A qront rv-^ny perceptual experiences can be undrf^rstood by 
considering) tho? piM^cei vi nq person to be a statistical ma- 
chine cri^^^oblo of quickly estimatinq probabilities. That 
is, each of the cues present now is related to many past 
experiences. Past experiences provide a kind of table 
of probab i I i IMes accordino to which estimateis are made, 
but the perceivor ties to make use at once of the exper- 
ience tables correspond i nq to each of the cues, some of 
which will point in one direction, some in another, 
(p. IIMI2) 

Recently the psycholoqist David Dakan (1967) has attempted "to gain an 
understanding of the nature of the learning process through the examination of 
one particular forn^ulation of .the nature of the scientific method, the prin- 
ciple of inverse probability'' (p. 58). 

To see that the theorists of subjective probability intend to develop pro 
cedures applicable to uncertainty as encountered in everyday life one need 
only consider some of their basic writings. Harold Jeffreys In the preface to 
the first edition of his book Thcoru of Pvohdjilitij (I96i) says: "The chief 
object of this work is to providv3 a method of drawing inferences from observa- 
tional data that will bo self-consistent and can also be used in practice" 
(p. ix). Also that "the fundamental problem of scienti f i c proqross, and a fun 
damental one of evoryday life, is that of learninrj from experience" (p. I). 
I. J. Good in his book Pvobdxility and ih\^ Weighing of Evidence (1950) states 
that "the aim of the presonf work is to provide a consistent theory of proba- 
bility that is, mathemati cal ly simple, logically sound and adequate as a basis 
for sci ent i f i c i nduct i on for statistics, and for ordinary reasoning" (p. 2). 
Elsewhere, Good (1965) says that: "The difficulties become clear when it . is 
realized that we estimate probabilities every minute of the day, at least im- 



plicitly, and thut how we do this Is unkiiowrr' (p. iv) Dnd ''Neve rtho loss, for 
purposes of makinn d:^c i s i oiv3 . wo 'Jo r!K:ii"i.icie to apprcximu io osti mates of proba- 
bilities. Mow i'his is done iiJ an i nt^rei^Mnq problon in psyc.holoriy and nouro- 
physiology" (p. 4), 

To quote one other sourco Alberoni (196?) ar^'uc:-} that probabilistic 
thinki no "comos into ploy evorv tiirt3 a man finds hinsolP faced with uncertainty 
and he must take decib^ions and 11 stand with rospoct to the future while basing 
hin'.self on uncertain or i nco^^i'i lote know I odqo'' (d, 285), It is this cfiaracter- 
istic ofonswerino rnu 1 1 i p I e-cho Ico questions that led do Finetti to propose the 
use of aiternati vo n:oosurGrrent procedures, Thcjt is, in the usual testing situ- 
ation the student has to make a decision and take a stand ^select one option 
as correctj wii'h resnoct to the future Chis selection will be nraded correct 
or incorrect] and this decision may bo made on uncertain or incomplete infor- 
mation Che knows he does not knew Ihe corr*eot answer but is '^fairly confident" 
about the correctness of sorne of the optionsj. For the norson who is ''certain'' 
as to the riqht ansv/er, the proble/n of uncertainty does not exisi and his best 
response is to indicate t|-,ot option, do-inoi'ti/s (1965) paper is normative 
in that he is '^lot interesred here in the actual behavior as it results .from 
the habits or other psycho I ori i ca I tendencies of different persons, but. in 

■ mos t 

analyzinc] wi^at resi;onse is ls^;t^ advantaneous in the face of the uncertainty of 
any niven situaMon'' (d,. B7). 

V/inkler and Murphy (1963) in discussinq several uses of probability scoring 
rules for evaluating metooro I ogi si"s allude to iwo other reasons for usinq pro- 
babilistic testinn. These are (1) to help poo|j|e becorie "better'* assessors and 
(2) to evaluate people in a substantive area. The first reason is closely re- 
lated to de Finetti -s philosophy. That is, the mu 1 1 i d le-choi ce testing situa- 
tion may be a very cood situation for teaching people the fundamentals of pro- 



7 



bobilistic docision mnkinq;, l.-j.^ ilio dbi Miv/ to accurately Gpecify probabilities 

which roflect tho por-i^on^i; subjoclivo boliof<5, The application of probability 

testinn ir^ confounded if you do not havo "nood" probrt^b i I I ty assessors .in the 

normativo scmiso of pc^i^^o'^-J nn .-o-nc --xpe^tiso in probability sssessnent. 

The socononJa^OiiY^i ^ on*:; itiot rirrzit people in oducai'iorr.') I rnecisurefnGnt v/ould 

consider nost important. Cnn ;M-obability iestinq be used to ovalunte peopio 

In D sub'3tantivo context, and if 'io, do tho procedures yield teiit scores which 

are moro reliable nnd/or valid t'hrjn convent i ona T tcsti nq oroceduros? If the 

■proceduros do not increase reliability and validify then Gcme will arnuo why 

bother to expend the add i I I ona I tirn;;^ and moriey to ur30 tlion. To quote Lord and 

Novick ( 1968) : 

Tfujs, at present , the sole f ccormondat i on of these new 
rneltiods is tf)«.-!ir troriO conceptual attracM veneres , In 
evaluatinn any :vew response f^elliod, it wi II be neces- 
sary to L^hov; that it add:, /roro relevant ability varia- 
tion to the system than error variation, and tiiat any 
such relative increase in i formal* i on retrieved is 
worth the of for*^. . . (p, 3M ) 

As with all other mental lest t|-,oories, validity of 
this theory r;ust bo eotoblished by us inn it to ntake 
and verify inportant- arodi criofi:; , If ti;e trieory of 
personal probai^i I i ry in apa^ication to the assessrr.ent 
of partial knowlodoe sti']qesl\^ cerl'ain 'fii^asurof^ent pro- 
ceduros and related i tef^i-scori nq and t tort-v/oi ef.t i nn 
formulas fhat are then emiurically established to be 
valid predictors, "ihon thij: tfieory will have been vali- 
dated for th. i s particular purpose, (p. 3113) 

Coorrbs (1953), Coorr.bs et ai. (191)6) and Shuford, A|[)ert and Massenqf I 1 

(1966) all argue t.hat differential choice of disi'ractors allows an examinee to 

exhibit partial infonnaf ion and that this should produce qreater iten and test 

variance but siiould reduce error variance, 
reason 

Another rzLticsnsHt^ for usinq a decision-theoretic approach is that by using 
the concept of a utility function it is possible to. specify tho types of situ- 
ations where a student can "rationally^' be a risk taker or where he should be. 



8 

"honer^t*' in roportinn liis hruo Lnolief^ii ('toby, Rippoy, 1971; Murr.hy and 

Winkler, 1970), The probloir/;^ nr^soci ri+od v/ith quTOsinr) and risk takinci are 
not peculiar .to probnbi listic tosti nc], /and if it is arquod that in a qiven 
testlnq situation tlio^je aro irr.;Dortant considerations thon p robyb i I i rjt i c testing 
yields a conceptual and mntho!*'\Tti cal fornot for includinq th\'?fn. 

To summarize wo nave listed tlie fuilowinq reasons for cons i d^^iri ng proba- 
bi. listic testina. First, tcostinq involves mokinq docfsions under uncertainty 
as do many situations facod . everyday ond as such should bo solved by usinn a 
''subjective probability decision theoretic" parodi grn. Second^ using multiple- 
choice testing situations may be a qcod way of teach inn +h<3 subjective proba- 
bility decision theoretic pcjradiqm. Third, probability testi n:i procedures 
should lead to more reliable and possibly more valid tests. Fourth, probability 
testi ng in conjunction with ^:f)ocitic utility functions yiolds a way of incorpo- 
rating and handlinr) 'Visk'' and "nuossincV behavior' in iostinn situations. 

Theoretical and Emni-rical F ind i nqs • 

This section will ottor-pt to Glal>orate on the summarization and critique 
provtdod by Echternacht (1972). To avoid duplication it is aqain assumed that 
the reader is familiar v/iih his cliscussioi^ (pp. 223-235). Some references not 
cited by Echternacht vmlrl alv-o bryy considered and an annotated biblioqraphy is 
included as /\ppondix A. 

Echternacht (1972, p. 22^) lists six [}roliminary assur-ipt i ona underlying 
probability testing, while Lorx) and Movick (I9G8, p. 319) list essentially the 
same assumptions but distinguish only throe assumptions. Since most of the 
results noted in this .section refer I'o i'hose rjencral assumptions it is worth- 
while to quote Lord and riovick's assumpti<,->ns . 



1. The Gcot'lnn nclhod, o:. woM ilie |)orml1i-ed nodes of 
rcMU)ondi nn, fnij';*t bo known \'o 'the r>ubjGctr). Furth^^r- 
moro suhjoctrj mu'^t noi* only know the rnothod but lonrn 
to undofotontl fully \is 1 fnO I i cat i ons v/i th particulor 
roforoncc- to bchr.vior in the fcice of uncertainty. Fi- 
r*.;3lly tfioy mu'jt bn nblo to rroko tbo nect^ssary conpu-- 
tntions to dotennino an optima 1 r^trytoqy for each itor, 

2. FxarMineG5 nius I be keenly interested in obt-nlrnnp a hlqh 
total j:)Coro, Droci<?caly in the sense of naximizinq thoir 
tot<ol exoectod -^scoro. 

.3, Thoy iiHist bo rililo to nnsinn nun^ericcji vnluos to thoir 
subjoctiyo probabi I i tio:5 accurately and rolinbly. 

Sinco in probability toslincj the examinee Is required to specify throuqh his 
subjective probab I I i tins his doqrce of belief concernino the various options, 
tfie solutions to problon-j associated with the nuant i f i cat i on of these beliefs 
and their evaluation is cen1*rnl to impleir.eni'ino nrohability testinp. 

Van Naorsson (1961) in di'-iCussinq fho r^oasurernent of subjective probability 
was one of the first to note that If the candidates are not inforn^jd about the 
scorinq method then the score will depend on the "accidentally chosen strateqy*' 
of the candidate. Mo arnues that by tollinci how manv points they can qet with 
each probability ratinq and explain! na that the aim is to not as many points as 
possible, '*a stronger anchorinn of the ratinq catenories wi I I be obtained and 
also a more impartial oxf^orirr.unt in wli i ch the selectors (are able to) know how 
they stand" (p. 151). Van Naerssen derives t^^D of the basic scorina rules 
(logarithmic and quadratic) used in orobability testinn. Toda (1963) also ex- 
perimented with those two scorimi schemes. Van f-iaorsson also points out that 
in deducing these rules it is assumed that the utility of the score is a linear 
function of the score itself and the effects of non-linearity still need to be 
determined* Roby (1965) notes that this difficulty is encountered because the 
person's expressions of his iniernal belief are influenced by the person's in- 
terpretation of where the "nayoff lies.** Roby shows a possible solution lies 
in rcwardinq the person in "direct proportion to the validity of his be- 
lief* and that If this Is done tfien the maximum expected value for a per- 
son's score occurs when the person ^'bets^^ or responds with his true beliefs. . 



10 

Roby clovolo:5od n $cori rulo for roworJlnr, people which cat loci +ho "sphorl- 
CcJ I " scori no syc>teri. 

Shuford, Alberi' ond ! inscicnrii t t (1066) ^^how tlu'jt the qucidrat 1 c, spherical, 
and lognrilhmlc Gcorin'.-; ruleo pcss^>^-»s tlie property thnt oxaminec; can mnxl- 
ni 2G his expected liccn; on a toy I (nssurilnii o llnonr utilily functions) 
If, and only if, !io honori-fly roflectod his personal probab I I i t ios , that is, 
when the exonlnee^i: oxprosv^oJ [^roboD i I i 1 1 es corresponded l-o his true orobabi- 
lities. Tiioy also point out th.vi" with iho quadro M c r:ind spherical scorinq 
rules the t;Cor^-^ for ^ny item is do tc^rnii i.ed by the nroh.-jblMty assinned to^or- 
rect onswvir and the v.\iy in whicli the student's uncertcii nty Is dlstrlLjtod over, 
the other oplio^. Tor instofKo, if (n"^ i5 the correct jn^v/er^ to ci three op- 
tion test itoru Hion the two roH^ponses (.4,. 4,, 2) and (,4,, 3,. 3) would yield 
different scores. Iiov/over^ tlio scori nci rulo3 arc "Gyrr,me tri c'^ in the sense that 
(.4, .2, .4) would yield Iho soiro 1 leni score ( .A , , .2) . On the other hand 
the lonorithrnic scorinn rnjio i-^ n function only of the probability jssiqnod to 
the correct nnsv/er. They conclude their arnurionts for usinq probabilistic 
test! ng by say i tvr. 

in coniMdorinq tlTO substituiion of ad^xi ss i b ) orobebility 
roasure.rent prncedure';^ for the choice :,rocodijr^-r> in cur- 
rent use, it Is Ifvporiant to re^H i ::o th'jt no infomaiion 
will be lo;^ throiA^fi the substitution ^/\r\ce a student'r. 
choices con be niconstruc.t^'d fro-:-! knowledco of his pro- 
babiliti^^s and his utility s^ructlirc v/ i th rosooct to the 
testinn situation. However, vhc: devolor.ronf of anDronri- 
ate psychorrotrics and t^st theory \/ould rjreat ly facili- 
tate the exnioitation of the add'itionnl information rnnde 
available rhrcuah tlio use r>i ed.i:! ss i b I e nrobaljiiity nOvj- 
surerr.ent (M'ocoduros . (p. 14^]) 

Winkler (l9G7a, l9G7b) also discusses sorno problen'i assocloted with the 
quantification of judqir.ent. In his 1907b paper Winkler first notes tho distinc- 
tion between a "qood" assessor with respect fo tho personal i stt c theory of pro- 
bability and a ^jood" assessor who is know Icdqeab I o in tho area under consider-. 



otion, Tho first context doal:? with caxportlr>o in '(ho qenoral area of probabl 1 1 ty 
assessment, white tho second context cloali^ with oxportise In some area of ap- 
plication. In usino probability testing in ("Mlucoti ona l' measurement we are trying 
to rewcird the most know I Gi:i(inab I e In tho yocond context assuminq that the examinee 
hc3s learned to bo .a "cpod" ass-:ossor in 1*he first context. These tv.'o contexts 
are Identified respectively as the "normotivo" and "subr/t-anti ve" by Winkler and 
Murphy (1968), In. the norniativo sonr;o a good probabi lity assessor is one who 
obeys certain postulates of coherence (consistt-*;ncy) and who expresses probabi- 
lity assossrr.ents which correspond to his subjective beliefs or judgments. The 
actual quanti f i cafion con be accomplished throuq^l usinn interrogation and bets 
or through using scoring rules or "penalty functions" which db I i go the person, 

under linear utility, to express his true probabilities. It is the latTor that 

are (^^ii.^^^' 

used in prohabilil'y I'estinq, Wi nldcT^ljoesn ' nrgue that everyone is neces- 
sarily a '"'good'' assessor in the normative sense but he does argue tha*"* people 
can be trained to be ''good" assessors, he expects people to learn fro.m^exper- 
ience. Training and experience should i ncro:,'3So a person^s understanding of the 
methods and fewer inconsistencies sliould bo observed. Training and experience 
should also tead to a irore reliable specification of subjective beliefs into 
probabilities. That is, naive assessors ten;:! to respond in certain idiosyn- 
cratic manners, i.e. they use such nunbors as 0, .23, .50, .75 and 1.0 too of- 
ten or in tesfinn they weight only orio or two options. 

Dy coKiparing ir^t3 assessments and the actual values observed ISitefe^ argues 
-tKe 

that « person can use t[iis inforrnation to I earn to be a "better" assessor in the 
second context as well. Such information would be useful to evaluate a person's 
"bias," i.e. a tendency to consistently undercsli mate or to consistently over- 
estimate with respect ^o coriain probal) i I 1 1 ies and situations. Shuford and 
Massengi 1 1 (1970) present a way of ova I uat i ng such bias for people using their 
SCoRole. 



12 

The cisr^uirpt ion of i\ linear uN I H'y function and r i r.k-tnkl nn oncl ri 5:)k-avoi di nc. 

are also rciisod by Winl;lerV ilo nol'ecJ i1)0 i^rob lofci; aSi*;oc! aiod wii1i a non-linear 

c^oti point ost th^f \i 

utility bu't did not presont j fiolution in t:'ii^.i pnper. VrJr^'itei^is^^^^ rlsk- 
takinij or ri ok-avoi di nn^ji:^i?<'<^ porsists*. over t i :ri^:'^thGn the Qorr>on is not following 
the postulates of coherenco or ho i^-, opor-il'i nr) under some other utility function, 
■Winkler (1969) points; out since (jrobab i I i ty assessments must be made before 
the actual outcome is known, thon no matter which scorina rule is used, the as- 
sessor should maximize his oxpccied score or expected utility. Any of Shuford's 
ot al. "proper" scorinc] rules can be used in this rorjard to evaluate assessors 
in the normative sense. Howovor;, Ihe evaluation of assessors in the substantive 
sense occurs after the outcoire ir-; observed. V/Inkler proves the loparithmic 
scoring rule is*''' only one that is compatable with both types of assessments. He 
also showed thai" it is possible to relax the assumption of a linear utility 
function provided you know Iho form of iiie non- I i near ut i I i ty function. That 
is, corresponding to any utility function U and scorinq rule 5 wiri ch is ^'proper'* 
under a linear utility function, it is possible to find a scoring rule which 
is also "p^''op^'^" under U. This point is e^si'endod further in V/inkler and Murphy 
(1970) and Murphy and Winkler (1970). Also ifnoortont in the later article is 
an introduction 1o sensitivity analysis of scorinci rules, Tiiat is, how sensi- 
tive, in. the sense, of tfie scores assinnod, arc the scorinq rules to deviations 
from optimum assessment of p roljab i I i t i os . Tfio more sensitive the scoring rule 
the more it ^^punishes*' an assessor as fie deviates frorn reporting his true pro- 
babilities. For three values of (the true probability) ihey show that in gen- 
eral the logarithmic rule is loss sens i live i'han the quadratic, which in turn 
Is less sonsi.tive than fho spherical, althoud) for small deviations all tfiree 
rules are fairly i nsens i i* i ve . Atthougli i"hey don^t mention it, ihis may be a 
plus In favor of these scoring rules wh.en used in prol'jab i I i ty tostinn. One 



13 

objection sofr.etl n:Gi^ aivcn to probnbiliiy tosi i nq \<s thai* unicGS the ei':)r;.Inee 
is an Gxport in probobility ori5.K;{^*:rri'.v]t you rnny introduce more error variance 
through its' use than you olirninai'o. What sonsi'Mvo nnolyr^is rniofit show is that 
one does not havo to ossurro expertise in probability assessment before using 
probability teiitino for "subsi'Ofi M vo" ovaluafion, 

Staal von Holstein {l')70o) notes that 1lie practicDl tisos of scorinq rules 
as foedback dovicGS hiwe been restricted to the areas of fnoteoro lociy and educa- 
tional tcstinrj. Probobilily as5G3iirf:on t '^"?xnori rorrf s liave also beon perforiTied In 
the areas of football (de Finotti, \062; Winkler, 1967c), stock market prices 
(Stael von l-lolstein; !'J69) rind weather forecasls (Staol von Holstein, 1970b). 
These experiments oli used a qiiadratic scorinn rule ':ind show ihat it is feasible' 
to obtain probabili ry assossrr.enis for ncn-d i chotorrous situations . It was not 
tlear from all the experirnonl s wiiother '\he subjects in fact becarte better as- 
sessors in the norrriotive sense during the course of the experiments. This was 
also found in a testinn situation by Hansen (1971). In addition Hansen found 
statistical ly siqnlficani- correlations betwf-;en a measure of depree of certainty 
in the examinee^s responses and {ho scores on the F scale and Kogan and Wallach 
risk taking measures. It should bo noted that ali'houqh the correlations were 
significani" they were also relatively low (-.211 to -.'111) with most of them 
below -.250. Hansen used the spherical scorinq rule and obtained split-half 
test reliabi lities of .78! and .766 for his iwo tests. 

Phillips (1970) arques that probability judqraents can be affected to varying 
degrees by rrerrory ono coanifive processes, prior experience and i nformati on, so- 
cial and cultural norfns , persona I i ty , and {.loqnitive style. He concludes that 
to the exient we afjrco on these vari allies iliev sfiould be the focus of future 
research ''since effective traininq can be desiqned only when we know how these 
factors influence the naive person's judoments'^ (p. 234). 



U1 

Sorio other ornpiricul Giuciioi; dono in oduc:-:5t iona I testinci are Mich-icI (1968)/ 
Rippey (19(38, 1971) cMKi llemb loton, Roborts and Troiil) ( 1 970 ) . i-lichael 

(1958) used fho scorinc) rulo S - r^^ whofo \s the probability nssiqnod to tho 
correct answer. The probabilities exnrC'f:i3Gd were also restricted to simple 
tenths. Mltliouqh she found hicjher rel iabi I ii'ies and lower s*l*andard errors it 
must be noted tliat this .scorinr; rule is not -'admissible" in thf3t it requires the 
person not to express his true probabilities v/hen tryinq to maximize his ex-, 
pected score under linear utiliiy (see . Wi nk I or ^ I9r>7b, p. Mil). 

l4c^rv^bI^ tor) 

fe23fei:±S3a, • Roberts and Trnub (1970) ccnpared probability testinq using a loga- 
rithmic scoring rule Cpossiblo probabilities were 0, .05, .10, .95, 1.00] 
with conventional test i nq' and di f ferenti a I weiqhtinfi. They found probability 
testing yielded the h i qhest -va I i di ty (corro I rat ion of midterm with final) of 
the methods (.720) and the lov/est split-half reliability (.655). For the con- 
ventional lest the va I i d i ty and ro I i ab i I i tv were .621 and, .710 respectively. 
Two other points of intern:; 1" in this study fe'iho introduction of an answer graph 
for reporting probability and ment.ion of the fact that the difficulty of the test 
will effect the application of probability iestifia. For instance, the test they 
used was "easy" for the students involved. in the qroup usinq probability testing. 
77% of the time they indicated a probability of LOO. In this situation as- 
scss i no "part i a I knowledqe may not be a nrcat concern. 

Rippey (1968) applied l"he I oqar i ihuii c and spherical scorinq rules to the 
Same set of probabi lity res[)onsc5 on a varie'ry of tesis and computed the test 
reliabilities. In comparinq these reliabilities he noted thcit automatic in- 
creases in reliability were not found. However, it must be noted the people 
involved had no experience with probability tostinq and from the^ '.^stereotypical 
student responses" observed probably would not have passed even a minimum cri- 
terlon of a good assessor in the. normative sense* Another drawback Is that as 



15 

Winklor (!967b) <*.HKi Slu.iford ot aL {|96i5) point - out It is important for the 
person to know and understand ihc inothods t.^oina usod. In particular the scorinn 
rules may not yield consiivhent rosuH'f:. wh(-Mi .:ij)plied to tfie sarre expre:?S5ed pro- 
babilities. Shuforcl ot al. (lyoij), <y:h well ns Vi inkier and Murphy (I95S)^ note 

that the lonarl thinic rulo \ > concerned only witii ihe probability assigned to the 
Correct ^ufwor,^c 

mjttecKS===^^ while the sphericnl aiid quadratic aro concerned v.'ith all 

of the expressed probab i i i r iej:>-. Hov/over, even these two rules weiqht the proba- 
bilities in different ways. Winkler and ilurphy (1968) pivo a numerical example 
in whicli the I oqari thmi c rulo yields a fMohcjr score for assessor A than assessor 
B, but If the spherical or quadratic rul« Is U3ed for the some probabilities 
assessor D is given a higher scor^ than assessor i\. This fact could, indeed, 
affect the reliability and validity of a test dopoi^d inq upon the scoring rule 
used.' They temper this findinc somewhat by noting that they have "evidence that 
rankings based upon averano scores .-/ill bo reasonably consistent" (p, 7136). 
Rippey (1970, 1971) reports on another study ho completed on the reliability of 
five different scoring rules. The 1970 ref'.--irenco is a journal article while 
the 1971 reference is Ihca final report for trv^^ UjOC grant. The experimental 
setup was essentially th.o safno as in the 1968 study, in that, it involves naive 
subjects and applies five scoring rulni? to the sane expressed probabilities. 
The fact that people might and probably should respond differently under differ- 
ent scoring rules was not considered. The probabilities that the subjects were 
allowed to use was limited to simple ninth, i.e. 0, 1/9, 2/9, 8/9, 1.0. 

In his 1970 reference he recommends using the scoring rule 5 = r^ (see Michael 
(1968)) since it yields the consistently higliest ro i i ab i I i t i es although the ^^Eu- 
clidean" rule produced "comparab 1 y • li i gh reliability.'' In his (1971) reference 
Rippey tempers the recommendation for usir" S ^ r^^ by noting the objection raised 
above with respect to Hichaer's (1968) article, and by the fact that his sub- 



16 

subjoctfj '.vc^p-t n>jiv;v. If 3 lM''i;;r.i:; oo \'.::\ru or .'ir- 'inl'-l -hfh? optinij'- iirnionv for 
ihii^ '^rofi-vi ru|.-' ti-* ,^ I -I ; ;): >r*T:; tin: v^h'Md orC'Corluro. 

The above noniMonC'd litorniwro i ndi Ccrt"i-^i; that ci considerable amount of 
theoretical *'/ork hof-;. been done. Th^;^ empirical studie::, at least, indicate 
the feasibility of t/^inci to Implement probahility tostinr). Also some of the 
studies 5U(igest areas i ri need of mono rosoarch and it is possible to extrapo- 
late other problem areas fron the I i torat'iiro . 

Areas for F i i r thor Reserfir ch 

This section of ttie paper wi I I attempt to list sonic of the areas for fur- 
ther research that have been identified by the author and others. 

Much of the literature rovtovjed above stresses the importance of training 
and experience with probability tcstinq in t'ho ^'normat i ve'* sense before it can 
be used in the "substantive"' senso. Some cn the research reports nention at- 
tempts to farrii I i ari ze students with the scorina rules, throuph hyootheticai 
examples, etc.. (Hamilton, et al., 1970; Hanson, 1971). llov/ever, one could 
classify those ai-temf^ts as crfentation ratfior than deliberate traininc], In . 
the rigorous sense, with a test for mastery, roteni'ion, etc. Ph i I I i ps ( 1970) 
■ mentions some variables i'hat should be examined in trying to develop training 
programs \n probability assessm.ent and prohnbility tf*-:-^tinci. In a related con- 
text Movick (ACT Technical bulletin Mo, 5, no date) has suggested the use of 
an interactive computer as a strategy for the training of naive people in the 
* area of Bayesian statistical analysis- Rippey (1971) suqnests the use of a 
computer to supply the necessary feedback when using probability testing. 

One of PIrillips (1970) variables was "'personality*' and it is also one of 
the psychoiooi ca I vari ab I es need i ng further study mentioned by Winkler and 
Murphy (i960, 1970), and de Finetti (1970). Literature concerning personality 

ERIC 



17 

charac+oristlco associated sub.jc^ctivo [^rohab i I f 'fy , risk (aklnq, and decision 
makinq ore reviewed by Or i clicKrok (1970), Slovic (I''j6A), and Kenan and V/allach 
(1967). 

Winkler and f-'iurpfiv's (I960) ideas of partitioning asi3e^*7Sors into "good- 
ness'^ cafegoriGG needs to bo cjxtf^ndod. One quc^Btion of interest would be how 
does "cpodnGSs" in the normativo isonoo c=}ffc^c1' reliability and validity of tests. 
Closely associated with this is the need for further work in sensitive analysis 
(see Wanner, 1969) to see how much "expor'M se" In probability assessment is 
really needed. They suqnest that the senr^itivity quostion may also bo related 
to psycholoqi CQ I factor's. Much of the oxpor Wr-efitn 1 work in probability testing 
has restricted the examinee to limited probabMity points such as twentieths, 
tenths, or ninths. Are these too restrictive for probability testing to be ef- 
fective? \ 

Certainly work needs to b-:; done in dovoiopinq the ' appropr i ate psychometr i cs 
and tost theory^^to make ufje of i"he addil'ional informalion sunpMod by probabi- 
lity test inn (Shu ford ^ ot a I., 1966), Since the various scoring rules use. the 
expressed probab i i 1 1 i os in difforYrint ways, in what tostino situations should 
di f ferent scori ng functions be used? Also should different procedures be do-- 
veloped for evaluating item discrimination and difficulty, de Finetti (1970) 
suggests' looking at the distribution of probabilities given to the same events 
by different individuals or groups of individuals. -Me also suggests that indi- 
vidual scores be compared with the score of a fictitious person "who adopts as 
his subjective probabilities for each event the average probability given to 
this event by a group or subgroup. It often happens that this fictitious player 
is near the top of the performance range" (p. 142). 

The Implications of non-linear utility functions need more theoretical as ' 
well as experimental work. Sensitivity analysis is a Iso apo t i cab le here. How 



13 



niuch does tho ulility function have tcj dipjvicji'o from lineority before the ex- 
pressed probabilities should bo shifi'od -frcn ihoi r in-uo vrjjues? If we are 
forcing students inio s i 1uat ioDi;-. nocri5:3 1 hat i n(.i non-linear utility^ then should 
we even be usinq objectivo hests no ifottor hov.' tliey are adni n i stered? 

Reo I i cot ibn5J of previous} expcriineivhal studies with in^.proved procedures 
should be carried out, As ManVoleton ot aL (1970) says ''Hopefully, other in- 
vestigators will bo st i rnu I rjtod by i'he i nodoquac i os of tho present results to 
apply Ihe ir.ethodo I o/jy outlin^ad hero to investi<iete what is an Importanf problem 
in the area of testing" 01), 



19 



Rofe roru:o3 



Alberoni, F. Cofvi fl IujI icns to 'thn •:5'l'0riv of subjective probability: Predic- 
tion. II, Joiivnal of Gcruzval P\:)yohclom 1962,62, 255-285. 

Bakan, D. 0)i Hd'thod: T(ri}u)'cl i: f^<iCor/j>^2'UcnHcnu of Psyd^^^^^ lyvoestigation. 
San Francisco; Jc/::K}v-Ras^i , \\\c., I9('i7. 

Brichacek, V. Dso of TJjh ;..Kvt i v;:* probobilitv in decision mokinq. Acta Psy- 
chologica, 1970, 3^, 2^1-253. 

Brunsv/ik, E. Oro^jn i snn' c acli i eveinent and envi ronmonta I probability, Fsyaho- 
loflioal RcviaiJ, 1943, 50, 25'i-272. 

CooiT.bSp (-.H. On tho uiio of ob.i:?ctivo exa^^'i nat i ons , Fdiiaational and FayahO" 
lorjical lleaouremnt, 1953, ]3, :308-3lO. 

Coombs, C.rl., >Ii Miot l.,:jnd, J.E. and V/orer, F.B. Tlie a5^sessment of partial 

knov.' ledcjc. Educational and Pt?ijaholociica'L l^eaGurerncnt , 1955, jS^, 13-37, 

de Finetti, B. Ooes it make, sense to s;)enk of *qood probability appraisers*? 

In: I.,). Good (odJ, The ocierifyiot Speculates: An Anthology of Partly - 
Baked Idaao. London: Heinemann, 1962, 257-364. 

de Finetti, G. Moihods for d i scr i roi nai i ncj levels of partial knovvledqe con- 
cern inn u test item. Britisli Journal of Mathematiaal and Statistical 
• Puychology, 1965, _[8, 87-123. " ' 

de Finetti, B, Logical foundations and fTu-,v-isurerrant of sub jecti ve probab i I i ty . 
Acta FGyokolonica, 1970, ^^l^, ! 29- 1 45. 

Ebe I , RA. Conf i dence wei nht i nq and tost reliability.. Journal of Educational 
lleaviATem-rit, A^Zy^'j-a^ l^, 49-57. 

Ebel, R.L. I'leaiiurinq F.daicational AcliieiKmcnt. New Jorsoy: Front i ce-Ha I I , 
1965b. ' ' . 

Echternacht, G.J. The use of confidence tostina in objective tests. Review 
of Educavioncd Pcasarch, 1972, 42, 217-236. 

Good, I.J, Frobahility and tlie l.^cighing of Evidence. New York: Hafner, 1950, 

Good, I. J. Eotirr^ation of Pvobivn-liticG — An Koscvj on '^lodern Bai/esian 

Methods. Carnbrldgo: HIT Press, 1965. 

Hamb leton,. R.K. , Roberts, D.f-!. and Traub, R.E, A corpoarison of the reliability 

and validity of iwo inetl'.ods for a^sossinq partial knowledqe on a multiple- 
choice test. Journal of Ediicatio'nal Meacuremerit , 1970, 7, 75-82. 

Hanson, R. The influence of var i ah I es other than knowledqe on probabilistic 
tests. Joic^mal of Educational Measurement, 1971, 8^, 9-14, 



20 



Hevnor, K. "\. A nv^thou of corroci' i tu'} -for c]uosr:»inn in true-folso tests and em-" 
Dirical oviconco in !;u[^|)ort of it. Joiovial/ of Social Psifcholorpj , 1932, 
2, 359-362. 

Hil-gard, E.R. Tlv. rcWo of lonrninr; in ocrcer^t i OfT . IM: R, Blake and G. Ramsey 
iedi^B FevcriiiJonj Aj>prcach to Porsonalitnf , b\o\'i York; Ronald Press, 
1951, 

JeffrcYr?,: Theovr of -'ol:ubiiit:j (3rd od.). Oxford: Clarendon Press, 1961. 

Koqan, N. and V-'clI'v^h, i-.isk t'ukinci as a function of the situation, the 

person and fin,„p, !n: /'/.j^j iH-veritions in Poifcholorfijj III.. Nev/ 
•York: Holt, Ri nonart and Winston, 1967; 111-273'. 

Lord, F. and Novick, M. Statii'iiccL Thaor^ies of ftental Teat Scores. Reading: 
Addison-Wers Icy, 1963. 

Michael, J.J. The tol lability of a inu M'i o Io:-cho i ce exar-ination under various 
test-takinq i n:^truc:t ions . Journal of Educational Mecisia'^epjent , 1968, 
5^, 507-314. "* . 

Murphy, A.M. and l\'inklf';r, f<.L. Scoring ruins in probability assessment and 
eva I uat i on . Acta P^iidlioloqica , ! 970 , .3^, 273-236 . 

Novick, fl.R. Bayosian ccmputer-ass i stod daia analysis. ACT Technical Bui-- . 
letin, No. 3, no date. . 

Phillips, L.D. The ^truo probability' oroblen. Acta Paychologica^ 1970, 3£, ■ 
254-264. ■ 

Rippey, R.H. Probabi listic testinq. Journal of Educational Measurement, 
I960, 5, 21 1-21:), 

Rippey, R.M. A cc:)npar i so!i oP five difforojnt ?>cori n '{ f uncr i ons for confidence ■ 
tests. Journal of ,Edi(C:a.tional Mea^KVcr:;ent, 1970, 7, 165-170. 

Rippey, R.»"L Scor-inn and Analysing Confidenca Tests. Final report of project 
no. 7-0578, U.S. Depar Ifnorrrof Ho.^rj ith, Education and Welfare, 1971 . 

Roby, T.B. Bc'ilief rotates and tj-)o uses of evidence. Behavioral Science, 
1965, 255-27:). . 

Shuford, E.H., Albort, A. and (inssenqi I I , H.E. Adrni siS t b I e probability measure- 
ment procedures. Pojjahomoty^'ikay 1966, 31 , I25-M5. 

"Shuford, E.H. and Ma':^:;'?nfii I I , H.E. SColiulc Paoponuc Aid Pris true ti one . 
Lexington: Siiuford Masi^eofji 1.1 Corporal' ion, 1970. 

Slovic, P. A^r^essiment of risk-takinq behavior. PsycJioloqical Bulletin, 1964, 
6|, 220-233. 

Soderquist, H.O. A new rrethod of v/elqhtl nn scores i n a true-false test. Jour- 
nal of Educational RcseoTch, 1936, 30, 290-292. 



21 



Staol von ilolstoin; C. The; ;r:'.S(.<'.v"r!»?n I oF cliscroto ^>ui\jocti ve probability dis- 
tributions (in ox"-ior i i; • -n'l I -tlkIv. iJ.n i vc^rs i ty of- Stockho Ifn, Insti- 
tuio of Maihoniciticci I Stoiistic::, [{c^^h'avoh Report >!o. 'J'L, 1969, 

Stael von ilolstiHn, C. ^''o-r-.nr.jrront of liub jvoct i probaijillty. Acta Psycho- 
loniaa, I970o, 3£, I4C-I39. • ' • 

Stael von l-lolr>tein, C. ■\n '.'vxpc' r* i r'ani' in proh.-^f) i I istic woaihor forGcastinci. 

(input) I 1 5fiod n.ifiuscr ( ot , (Jnivorsity of Stockfioirn, institute of f'tathe- 
ratical Sta l ii:^tics, 1970!).. 

Toda, fi. 'leannreyieyit of •Jiib:ia^tiive J^roba^yUitii Diot'^'ihifHons . !Z£D"TDS-63-407. 
Dedford: Decision Science Laboroi'cM'y , Hanocom FirHd, 1963. 

van Maersson, R.F» A scalf^ for iiie r/^insurorerrl- of subjocfive probability. 
Acta Psucholorrica, \0G2, Ii)9-I66. 

Wagner, H.-l. Frinciplct] of Opex^ationo naa'. o'ch. r.'ew Jersey: Prentice-Hall, 
1969. 

Winl<;ler, R.L. The assessment of orior distributions in BayosicUi analysis. 

Jom'ncd- of the Ancvican Staticrtiaal Arjsoaio.tion , 6!?, 775-800. 

Winkler, R.L. The quani i f i cotion of jndqf-iont: some rnethodo logi co I suqqes- 
tions. Jom*nal of the American Gtatis t'ical AsGOciation, 1967b, 62, 
1 105-1 120. ~ . 

Winkler, R.L. The nuant i r i cat i on of judofnr.^r.t : ^^omo experifrentai results. 

Proaeedirujs of tJh? Arnariaan Statiatical Ansooiation, )967c, 386-395. 

Winkler, R.L. . Scorinq ruh-?5 and the evaluation of probability assessors. 

Journal of tJie Anarican Stattstiaal Ai^aociation, 1969, 64^, 1073-1078. 

Winkler, R.L. and 'iurDhy. ■'^J-l. ^Good' probabiiity assessors. Journal of 

■ Applied '''eteoroloriy, 1968, T^, 751-753. ■ . ^' " 

Vilnkler; R.L. and rUirphv, AJI. iNontincar utilitv and the probab i I i ty score. 
Journal of Applied IJateorology , 1970, 9, (43-148. 



I 



22 



Appendix A 



Introduction to ConfldencG Tostinn: An Annotated F3i b 1 lopraphy 



Janice Richnian, Chariot Sterirnon and Nancy fiorq 
Dopai'truont of liducationat Rc-jsearch 
Unlversiiy of Pittsburgh, Pittsburrih, Pennsy I Vcini a 15213 

The following annotated biblioqraphy hr:is bean includcod to introduce potential 
researchers to the qencral aroa of confidence te£.;i1no. This is only part of 
a more comprehensive bib I ioqraph.y that is currently beino compiled. Copies 
may be obtained by writing to the authors. 



Alberoni, F, Contribution to the study of subjective probability, L Jour 
nal of General Psijahology , 1962, 66, 241-264. 

This is an attempt to determine the psychological meaning of probabi- 
lity. The concepts investigated include the idea of probability and 
independence. Subjective probabiliiy differs from mathematical pro- 
bability when cause, rather i'han chance, is suspected to be operating. 
This may be posited when an order or pattern of some kind emerges In 
the course of a sarnple. Another difference is that subjects interpret 
the probabi lity of a sequence* os the probability of that outcome. The 
subjects are not always coherent. 

Alberoni, F. Contribution to the study of subjective probability: Prediction. 
I 1 , Journal of General Psychology , 1962, 66^, 265-285 ,' 

The psychological processes govern! ng probabilistic prediction are 
studied. When subjects were asked to supply the next outcome of a 
sequence of red and blue beads, with an equal number of each color, 
they used one of three strategies: randomly generating the next out- 
come with an equa I probab i 1 1 ty of se.lectinq either color, respecting 
the cyclic nature of the sequence or formally i mprovi ng- the ' sequence. 
The latter improvemeni assumes that the colors in the sequence wi II 
alternate in an irregular way, A fourth factor was added when an un- 
■ equal proportion of the two colors was presented: thequanti tati ve 
improvement of the sequence. This strategy Implies the outcome which 
best helps the colors In the sequence reflect the proportion in the 
universe. 

Atkinson, J.S., Bastian, J.R,, Earl, R.W. and Li'h/in, G.H, The achievement 

motive, goal setting, and probability preferences. Journal of Abnormal 
Social Poychology , I960, 60, 27-36, . . 

Need for achievement was related to preferences for certain probabi- 
lities in a risk-taking model. Those high in need for achievement 
preferred more intermediate subjective probabilities than those low 



ERIC 



23 



tn need for acJi icA'Ofr.ont , v/iio pre*' forr^^rj to "jot lliGnr,t> I ves qocjls wlt[) very 
• hinh (cM'^v -'.holfi) »rr very low ( cil f cu 1 1 shotn ) [M^obab i 1 1 1 1 e^i In an ef- 
fort to avoi'i .t'.ii I Ttio si h Joel" i v<; prcbabi 1 1 1 k;^; wore fnoar>uroc] In 
two Gitu.itions: n '•.huf -Mobojrd Mr.riir/, in which 5'Jb,ject:> could choose 
their distc-^nco (horo, r:;.[) jocti v.*; orcbobflity v/f)^, rr.oasured neonraph 1 ca I ly ! ) 
nnj in irnorpnory bottinp rJtuaiion, The preferences did not hold In 
all i'he boitlnci situations bui* only in those with a small monetary re- 
word ( . 

Bo kit, A sinpte ccf^fiUcnco tost inn format. KTS Research Bulletin No. 

71-42, 1971. E^IC Mo. ED 03(5 09B. 

• KRI.C Surijrriary : "Th i ir. p'oper presr-^nts thf> devo- 1 ODTOnt of scorinq functions 
•for uso in conjuncti<>n wiiti ^^tondnrd mu 1 1 i p lo-choi co items. In addition 
to the usual' indicotion of iho correct alternative, the examinee is to 
indiccii'e liis personni probability of fh*) correctness of his response. 
Both linonr end quadrrji'i c polvncrnlnl scorinq functions are examined for 
suitability, and o uniquo scorino function Is found such that a score 
of ?ero is assinnod when complete uncv:!rtni nty is indicated and such 
thnt the exaninoo con expect to do host if he reports his personal pro- 
bability accurately. A table of siniple integer approxj mat ions to the 
scoring function is supplied." 

Boldt, R.F. An approximotoly roproduci nq scorMnn scheme that allqns random 
response and omission. F.TS T^^esearch Bulletin No. 71-43, 1971. ERIC 
- No. ED 057 074. 

ERIC Summary: "Ono formulation of confidv*?nco scorinq requires the ex- 
aminee to Indicatrj as a number his personal probability o< the correct- 
ness of each alternative in a mult i p lo-choi co test. For this formula- . 
tion, a linear trans formai'ion of ifTu loqarlthm of i'he correct response 
is maximized if the oxanrii nee reports accurately his personal probability. 
To equate omits scores v-zith choice scores, the transformation can be 
chosen so that the score is zero if the examinee indicates complete 
uncertainty. If this is done, the scorinq function depends on the num- 
ber of alternatives. One could also alien undci^rtai nty and response 
omission by grant i nq crodi t for omlttinp I terns, ' thdugh It is felt this 
might be hard to explain to examinees." 

Cameron, B. and Myers , J . L. Son-o personality correlates of risk-taking. 
Jourrull of General Psychology , 1966, 74_, 51-60. 

The relationships between botting preferences and need states as well 
as other personality variables are investigated. The betting situation 
follows the Daradi fim or) qi nated l)y V-ard Edwards, and the Edwards Per- 
sonal Preference Schedule is tfh'> i ris'i'ruinent used to m.easure the person- ' 
ality variables. Betting proforonces were measured In both Imaginary 
and actual risk-taking situations, in that order. As in several of Ed- 
wards.' exper I monl's, probabl lity profon-^ncos are confounded with payoff 
preferences. Subjecis high in ex!i ibi tlon, aggression, or dominance 
tended to prefer bets with hinh payoff and low probability of winning, 
while subjects high in autonomy or endurance tended to be more conserva- 
tive. It is not clear that these five needs on the EPFS are In any way . 
similar to need for achievement as measured by Atkinson et aL (I960)\ 



ERIC 



24 



Coombs, CM, di the ir3'; of ol).iv;ci' i ve (^Xi-^tr.i n«yli oris . Eiiuaat'lonal and Pciiaho^ 

A procedure (or ndinini:!; I»';rl rjnd r;coi""inq obJecMvv^ tosi'3 so n:> to pro- 
vido 0 :vcalo f roni comp^lo lo nl s ! n form..T M on -throuqh sovoral dofiroes of 
.pnrtial InformrvM on . Is propoi?ocl (Coorfii)3 typo dl rn;ct lone ) . Individuals 
5l\ould bo ino** rucl*(3d to crov-s out all ilu-; cj I tornol I vcs thev consider 
to bo wr'rnn hui not, to ci;;vnif5 ainonq tho rGmoinhv':'. options. Tho wolghts 
used In the scorlnrj proci^duro aro as follows: ono point is added for 
ench vyronq oliornativo c:ro:r:sod otrt'> k-l Dolnts c>re subtracted if the 
rioht altornaii v.;-: I';> cro^syd out (k is tho nuiibor of options). Advan- 
tacjos of this scorinq n,othod nro surjoosted. 

Coombs, C.H., Mllholland, J.E. and Womor/ F,B. Tho assossmont of Pcirtial know- 
lodfie. Educational caid Psyc!ioto(]iaal i '\iac>ux*cmant , I9i36, J[6_, 13-37, 

This study corr.pared convontlonnl tost scorlnq with the scoring procedure 
outlined by Coonbs (1953) In terms rel i Gt)i 1 i ties, validities and co- 
efficients of dl scrimi Hfitlon. Positive scores for each Iten represent 
sore degree of pariicil i nfornnt Ion, while nofi^j'*'! ve scores represent some 
degree of misi nforfnation. Results indicate that examinees with less than 
corr.p I ete i nf orm?.it i on on ^ qivon subject may have considernbUi partial in- 
formation end thcvf this rooy bo u'3ed a::5 -3 valid basis for discriminating 
amonq thorn. Tho reliobllltles wore hioher for tests administered and 
scored by the experimental rr.efhod. This roliabilltv was even further In- 
creased for more difficult tests. Goth types of scoring appear to be 
equal ly val Id. Vihat constitutes a f;ood di scrirnl nati ng item is the same 
for. both methods. 

Coombs, CM. and Pruitt, n.G. Components of risk in decision making: Probabi- 
lity and variance preferences. Jour)ial of BxvQvimcntal Paycholocfy , I960, 
60, 265-211 . 

An alternative to V/ard fidv.nrds' 'theory of maximization of subjectively 
expecied utility, is oroposed. This model Involves variance preferences, 
as we I I as probab i I i 1y , skewncs:> and oxoectation preferences. An experl- 
m.ento I betting situoiion supf.orts tho hypothesis that variance prefer- 
ences exist and con be cienorated by folding a ,!oinl' scale. However, for 
each set of var i ance preferencos , a nonlinear utility fund ton of money 
can be found which nxplains the ordoring equally well. Skewness pre- 
ferences wore also found to "exist. One conclusion was the subjects are 
inconsistent In thoir preferences. 

Dale, H.C.A. A study of subjective probability. Brf-tiish Jouxnal of Statistical 
Paucholorrj, I960, J^3, 19-29, 

Adult subjocis ' predi ct Ions of how n sinoM nnmbcr of items would be se- 
lected by chofico from ?i lf*'><i ijr, }• u/oro romporod to the objective proba- 
bilities, Ti.c. .'.ubjects appeared to avoid unlikely configurations but 
HJJ Mo> considfor all the asoocts of tho sotocflon pi ucc«;s that the au- 
thors had detormi nod were imoortant a posiori orl . Three aspects of 
conf I guratlons wore chosen for cons i do rati onV range, bunching and sym- 
metry. None of the models proposed seem.ed to adequately describe the 
subjects' behavior. 



25 



Davis, F,13, EGtlntaDon and uso of scorlnci woiqiiTc. for each choice in multiple- 
choice it-.iii- iiT^nis. Eduoationol 'und PoyohoLoqical Heaourment, 1959, 19, 
291-298. ~ 

I f the? opi'Ions of ^.i nv,.i I IM p I o-chol co itffin arc to be v/oiqhtecJ accord! nq 
to their decjrt^o of correctnor^s^ Iho aopropr ioto weifihts remain to be 
dotenni ned. By admin i^vt•er•inr) the items to a lorqe representative sample, 
0 scoring weight for oDCh ilnm notion ccin he found thi:it is linearly re- 
lated i'o the nverono scoro on the criterion variable of those in the try- 
out sample who sotuct^-d thnt ci^oico. Sinco direct computation of the 
avcracie criterion score for the nrouo soloci'inq each option is very time 
consuming, a n:e*tl'io-.j of ;;stimatinq ihe criterion score means is given in 
tabular form, requirinq on^^ tlio percent of those in the upper 275? and 
in the lower 27;'», rc-/[)Ocii . ..'he rTv^iectod the given option. The es- 
timated means worn four-id t •\'>duro moderately reliable weights and very 
close to the weights ca leu I a red by the actual criterion-score m.eans. 

Davis, F.D. and FIfer, G.' Tho effect on lest rctiabiiity and validity of 

scoring aptifudo and ochlevoment tests with weights for every choice. 
Educational aiKl Vsijcholcniccl I h::)a^^ 1959, \% I 59- 170* 

It was found that scoring an arithmetic reasoning test by weighting the 
options according io their degree of correctness was m.ore reliable than 
conventional scoring. The validity of the test was unaffected. Weights 
were assigned in three ways: a PQ2f L ^^"^^ were determined indepen- 
dently by two mathcmoti ci ons , empi ri ca I weights were obtained by using 
a function of the average criterion scores of those selecting ea'^h choice 
for a previous group of oxa.ninees wlio took the test scored by a priori 
weinhts, and modified empirical weights were approximated from the 
scores of the upper and lower 27^ of the previous sample, a priori 
weights soefn to be a necessary feature in determining the subsequent em- 
pirical weirjhts. Otherwise both kinds of empirical weigfits may actually 
be based on differential appeal of the wrong options rather than. degree 
of correctness and thus may not be assess i nq parti al knowledge, 

de Fineftt,' B. Methods for d i scr imi nat i nq levels of partial knowledge concerning 
a test I tem. British J oiamal of Mathematical and Statistical Psychology^ 
1965, J8, 87-123. 

In the absence of complete information a person should be encouraged to. 
attach a probability 1o each alternative. This prohab i I i ty shou I d cor- 
respond to Ihe individuaMs degree of belief as to the correctness of 
that alternai i ve. ' Other answerin<i techniques are discussed, including 
Coombs' type of directions. All tecliniques are i nterpreted geometr i ca My . 
Subjective probability loads- to a scoring .system that makes sense, unlike 
the rank ordori ng or crossing ou1" of a number of wrong alternatives. 
Training in the use of a sui tab ty selected technique is recommended. A 
strong case is made for assessing and utilizing partial knowledge in- 
scoring multiple-choice questions. 



26 

do Fine*!)),, B. Lo:)ic:/!l found:-)t i oniii "jnd rnociciurofric^nt of Guhjectivo probab I I i i'y • 
Acta Jh)i!alu:doo'ioii, 1970, 

Si.bjoc!-ivo prol-^obi ! i I V \\, con::J {Ji)rod iho only nc^^nf nnf u I i ntorpretat ion 
of prohnbility. \'or^\'n involvifK} f'rof>i'J"'t I es or.f.>oci .:ji'ed with objective 
* ■ probob i I i t i OS , siich c)^ event and slochosi'i c independence, should be 

ovoided. \'vo[)M\\\'\y doqreo of boli<-''f tind rnust be or.ofvjl'ionally de- 
fined by '.so(r;o device hU'cb as offerinc;; a suitable sof of bets^. fixing a 
pannli-y, or i rrf rodiici nq an o;.^poriGnt. Probab i i i ti C3 must be consistent 
to be adniissibfe; hov/ev-or, lonfce:'! or empirical con^; i deration may sug- 
nost further ros I ri ctions . Scoring ruioi^ nre briefly discussed. Ten 
p^ycf^oloofcn I criierio for eva lua ti nq assosc^ors are outlined. Recourse 
to concepts of "olijuctive prob.jb i I i t'y " is exanitnod and rejected. 

Drossel, P.L. Scbmicj, J. So;i-o r.odi f i cations of multiple choice items, Edu^ 
cational and Vii\jtholoqioal •■aaaiwcmnt , \955, J_3, 574-595. 

Fivo scoring rnethonis ore coinparod in tv-jrr.s of their reliabilities: free 
choice,, in which c>ny nu!'d:)er of options can be ^:elocted only one of which 
is correct; d'-orv-.v af certaj^nty, in v/hich the siudent marks how certain 
he/stie was about^lho 7;ot ion c;eiectod on a scale of I to 4; mu (tip le answer 
in which any number of oo'f'ions con be selected and more than one option 
may be correct; i v/o^-nnswer, in which two options are correct; and a 
* con^/ent iona I test. The hiqhest reliability was found for the multiple- 
mt'Aief ies'l y Ihe tv/o-.onswor and degree of certainty tests had slightly 
higher reliabilities than tlie conventional test, 

Ebel, R'.L. Confidence weiphtinn and test rellabi lity. Journal of Eduoationat 
ttcaauvementy I96i3, 2, 49-57. 

A sys'rerrj of conf i der/ce-wo i phtod rospon-'^o and scorinq was developed for 
truo-false test iter's. A justification for thR use of the true-false 
fonnat in hiqh cjualily tesis of educational achievement is aiven. Pre- 
vious data fiad shown that te?:.ts weinhi'od bv confidence had significantly 
hicher re 1 1 ab i [ i t ics than convent iona I tost:^ , Recent data, however, 
showed a noe i i qlb le' i ncrease in roliabiliiy for the weiqhted scoring, not 
cnoueh fo justify the more comp i i cated technique. Simulating a set of 
Responses and Sicorino by weiMhied and conventional techniques suggests 
that confidence weivphtinq should only be applied to those items with a 
higher than chance {}robabi I i iy of a correct response (the criterion used 
In iho .simulation was two-thirds), 

Ebei, R.L. Review of ''Valid confidence testinq — demonstration kit." Journal 
of Educational Mea^^urcnient, 1968, ?3, 353-354. 

Thisjs a review of 5buford-iMassenni ! ) materials for Val i d Conf i dence 
Testing, which include: SCoRule resnonse aid, answer sheets , a scoring 
table, and a class analysis form. Ihe process seems complex, and the^ 
costs seem high. Indirect evidence as to the Indicated degrees of confi- 
donee being relaied to fbe proporl'ion of correc!' answers is given. Valid 
confidprico scoi-es correlate s(jbstanti al ly , buf not perfectly, with con- 
ventional scores. There is only incomplete support for Shuford and Mas- . 
senqilps claims of increased reliabilitv and validitv. 



ERIC 



■1 - 



27 



Echternacht., G. et o I . l):u:r^. hcindbcx*tk for confidonco i'ffini'ing as e diagnostic 
riid in ted>nica( traifiinn. ETS Ro\>orf Mo. PR-7I-I2, 1971., ERIC Ho. 
ED 055 I 

. ERIC Surnniriry: "Thi'i horidi^ocK. fir^'-iscnts i nrriTUct i ons for imrJefnonti nq a 
confidonco tcM?tinn f)fcqr-orii in tedinicnl traitiipn 3 i tuat i onj3 , icientifi- 
Ccjtion of possible nro-iv:* of apo I i caH on , t(i:chni riues for evoiluatinq con- 
fidence i n formnt f on, odvnni*aac-;s and d i.G.ndvantac)0>3 of confidence testinn, 
t i me cons i dore.it i or^ri , r^nri n rob ! oim droas . Cov-T) I ote i nstruct i ons for "Pi ck-- 
One" and '-Di si'r i biito iOO F\->ints" confidr-.nc'3 i-e:.tino methods are gi^.'on 
for testi nr] supor-vi iior s and oxaminoor. for both hand and computer scori ng." 

Echternactrl'^ G, The use. of confidoncp 'rest inn in objective tests. ETS Research 
Dulletln Mo. 71-41, 1971. ERIC No. ED' Ol'iO 307. 

ERIC Surnrnarv: "Confidoiicc iestinu luis be«^n usod in varyinp forms over the 
■)05t 40 years as a rr.etliod for incri~K>3inq the cirr.oLint of information avail- 
able fron obj'eci'ivo test i i'oms . This panr^- irao^.s hhe dv'^ve loprnent of 
. t[ie procedure from !i;:M/ner^s beninninn method up to the- various rrethods in 
use today and describes both tlio te^vfiiui orocedures and scoring nr.ethods 
used. The tern confidence: testinq is appii^^d to both probabilistic 
testinc] and confiderice weirihtina [jrocodures . Various procedures are pre- 
senied and thei r re I vvN onsh i p w i th persofia ! i ty factors discussed." 

Echternacht. G-.J . Th^o use of confid^ince r;5stinq in objective tests. Review of 
■ . Ediicat-iovxiL Hoseardi , 1 972 , 4_2 , 2 1 7-2 36 . ' 

Various tonns of c.-nfidcnce tci^nnrj r.iro described and ovciluaied, in 
spite of Jacob's Jlstinciion (M'7I) uofv/een confidence weiofitinc] and 
probabilistic "{estin-n? ilioy are l.iore subsurriod under one rubric, that 
of confidence testifio. The sole use of the. cri tori on of increasing 
re 1 i ab i I i ty i n avntuatint-] confidence t^-zsting is criticized. 

Garvin, A,D. Confidence weinl^tinn. Paper'- presented at the annual meeting of 

the Arr.ericar. Lducationa i .Research Association, 1972. ERIC No. ED 062 40L " 

.ERIC Surrrnary: "Various asnects of Coivfi dence '.Vei qlrt'i nq are examined. 
Variant of Con 1^1 denO;: Weiqhtinq, its effect on test reliability, and the 
validity of Confidence U'eightinn are discussed." . , 

Harnbleton, R.K., Roberts , D.i'. and Traub, R.E. A ' conipari son of the reliability 
atnd validity of i'/.'o rnofhods for assessinq partial knowledge on a mul- 
tiple-choice test, Joiiimat of Educational Measurement, 1970, 7., 75-82. 

Three groups v/e re compared on the basis of different instructions and 
scori no methods : convontional method^ differential weighting of dis- . 
tractors according to ifie degree of correctness ( deterrni ned by 22 experts),, 
and conf i dence f-es fincT us.ing an answer grapii and a reproducing loga- 
rithmic scoring function. Confidence testing was most valid and least ; 
reliable. Validity was determined by correlai'inq the scores with midterm 
scores. Relieibility was est! mated from corrected sp I i t-ha I f correlations, 
a method tha+ has been considered by some to be inappropriate for confi- 
dence testing. Two sets of . di f ferenti a I weights were developed from the 
experts' ranking, one considerably more complex than • the other , The more 
comp I ex weights were more valid and less reliable than the simpler weights. 
The simpler weights were as reliable as conventiona I scori ng and more va 11 d. 
t'-"- : A more di f f i cu) t test might have proved :more informati ve, . 



28 



Hansen, R. The influenco of variables other than know teclfie on probabilistic 
tests. Joia'rial of l:khicatio7iat r'^cmoia'^^^^^ 1971/8^, 9-14. 

Individuals who take oxai^ii nations usinn a probnb i t i st i c scorinq system 
display a relatively stable tendency v/hich cannot be accounted for on 
the basis of thoir stability of knowledge. Tl-if tendency of an individual 
to show certainty was detf^rmined from a function of the probabi Iftfes 
assigned to the options. This measure is highest where certain options 
are assiqnod probability of I and lowest when the probabilities are 
equally distributed arr.dnn tlic options. The test score'was comDuted using 
the spheri ca I scori nn function. The correlation between the measures of 
CGrtainty for two successive exams was .102. The correlations between 
test score and tho fneasure of certainty were. low. On the other hand, this 
tendency correlated positively with Koaan and Wai laches measure of risk- 
taking, the Choice Dilemma Questionnaire and negatively with the F-scale. 
Both correlations were moderate (less than .42), 

Hopkins, K.D., Hakstian, A.R., and Hopkins, B.R. Validity and reliabilH'y con- 
sequences of confidence weighting. Educational and Psychological Measure-- 
ment, 1973, 33, 135-141 . " 

Confidence weightinq studies are summarized in tabular form and are shown 
to have resulted gen.erally in somewhat liicher reliabilities. Three studies 
using subjective probability are subsumed under the C.W. rubric. The 
gain in reliability Is hypothesized to be a res.ult of a gambling response 
style, or i rrelevant source, in which case a decrease in validity might 
occur. A final exam was administered with confidence weights, of the 
form high, medium and low. An item score could range from -3 to +3. A 
short answer exam on the same material provided the validity criterion. 
Conventional scoring rosultod in slightly higher validity and lower re- 
liability than confidence weighted scoring. T!ie authors conclude that 
tlie added variance in the confidence weighting studies may be irrelevant 
response style variance since validity was not increased. 

Liverant, S. and Scodel, A. Internal and external control as determinants of 
deci s ion- maki nq under conditions of risk. Poychologiaal Reports j\ 960, 
7, 59-67. 

Internal versus external control is found to be another personality vari- 
.able enter i ng, i nto making risky decisions. Interna I -externa I control is 
a. construct which depends on whether an individual categorizes desirable 
and/or undes i rab I e items as within or beyond his control. The l-E scale 
used is an extension of work done by Comes (1957). A bettinq situation in 
which individuals can choose betwiien bets diffierinq in pay-off confounded 
with probability was set up. "It was hypothesized that internally con- 
trol 1 ed persons wQu I'd tend to employ a stratoqy which would attempt to 
maximize the number of favorab le outcomes . Externallv controlled people, 
v/ould be disposed to se led' bets m.ore subjectively, on the basis of 
"hunches" or tlie outcome of previous trials. The Ms did choose more 
immediate and fewer low probability bets, than the EVs. Significantly 
more I's than E*s never selected an. extreme high or low probabi lity bet. 
The amount of money waqerod on safe, as opposed to risky, bets was greater 
for I 's. 



29 



Marschak, J. ActunI versus consistent decision behavior. Behavioral Science^ 
1964, 9, 102-1 10. 

General hypotheses of decision behavior are suggested to explain how 
people rake decisions wlicn the problem is too complex for then-, to ap- 
ply "the utility principle. Those hypotheses include "rational" or "con- 
sistent'' behavior, I earn I no theory, stochastic decision theory, applying 
■ Gestalt theory, and ihe effect of training. Exnerirrents are proposed to 
determine v/hether subjects are applyinri tho princi{Vlos of cxoocted utility, 
nannely consistency, admissibility, independence. 

Michael, J.J. The reliability of a multiple-choice examination under various 
test-taking instructions. Journal of educational Measurement, 1968, 
5, 307-314: ^ . 

The reliabi I i ties and standard errors of measurement were compared for 
the methods of scoring the iiame test: conventional scoring, the number 
right corrected for quessing, • and confidence weighting. In the confi- 
dence weighting method ten points were to be distributed among the four 
alternatives. That metliod had the highest reliability and lowest stan- 
dard error of measurement of the three. The reliabilities broken down by 
sex and 10 were only ^ I !ght ly di f ferent under confidence weighting. 

Murphy, A.H. and Epstein, E.S. ' Verification of probabilistic predictions: a 
brief review. Journal of Applied Meteorology , 1957, 6^, 748-755. 

The' evaluation process is defined as one consisting of several ordered 
steps. The first step is to identify the Durposes of evaluation, which 
in this article lead to distinguishing between two formsof evaluation: 
operational evaluation, which is concerned with the va I ue to the user of 
probabilistic predictions, and empirical evaluation or verification, 
which is concerned with how closely the predictions correspond to actual 
observations. Oesirabie properties for empirical evaluation are enumera- 
ted as perfection and unbiasedness and compared with terminology adopted 
by other authors. Seven measures or scores of the properties are consi- ■ 
' dered, including probability scores, information ratios, and distance 
measures. Two prediction systems are compared on the basis of different 
measures. 

Murphy, A.H. and V/inkler, A.L. Scoring rules i n probab i I i ty assessment and 
evaluation. Acta Pay chologiaa,\910, 34, 273-286 . 

Scoring rules are discussed in the contexts of probab i 1 1 ty assessment, - 
in which the expected scores are' of interest, and evaluation, in y/h i ch the 
"goodness" of the probabilities should be measures. Scoring rules to 
be used i n assessment shou I d encourage ihe assessor to be honest in re- 
porting probabilities. If the assessor has a I i near utl I i ty function, 
scoring rules should be sensitive, to deviations of expected scores from 
the probability judgments. Four scoring functions (logarithmic, qua- 
dratic, spherical and ranked probab i I -i ty score) were compared in a few 
cases as to sensitivity. No 'conclusions as to which was m.ost sensitive 
could be drawn, although the logarithmic function appeared least sensi- 
tive. With a non I i niear uti I ity function which Is unknown-and cannot 



1^ 



30 

be incorpornl'cod inio Iho scorinq rule, the or-'Sec^sor statements may dif- 
fer from tfiG ris^jQSSor^ s actual judninent^- SGVGr;jl frameworks for aval- 
uatiofi woro doscribeJ. From the i n forent i I vi ewpo i nt va I i di +y , or the 
association betv/»-^on tho prc^babi lity stotements and the actual outcomes v/as 
most imix)rtant, Roberls' Bayesion rr.odol usina likelihood ratios was 
rnontioiiod, as were doci'jion theoretl-c frameworks. 

Pascale, P. I nnovat i on , i n item scorinq procedures, 1971, ERIC No. ED 056 096. 

ERIC Sumrnnry: "This brief review explains some alternate scorinq pro- 
cedures to the classical method of surrminq correct responses . The novel 
procedures attempt in some way to retrieve and use even the information 
in the wronq responses. 

Ramsay, .1.0. A scorinq sysi'em for multiple choice test items. Bvitish Journal 
of Nathematiacil c^id Statistic Psi/aholoffi-i , 1968, 2J_, 247-250. 

if the purpose of a multiple-choice test is to classify an i/idividual 
". i.nto one of two qroups, each alternative or option can be weiqhted by 
the differences between the? probabilities of selectinq that option for 
two criterion groups. Scores weiqhted in this fashion maximize the sepa- 
ration between the mean scores of the two criterion groups. The results 
are extended to more than two criterion qroups. Advantanes of this 
scorinq system are that partial know I edqe is taken into account, compu- 
tations are minimized, item selection is enhanced, and reliability is 
expected to be improved. Disadvantages are that the system does not 
imply that mi scl ass i f i cat i ons. are minimized and that it may indeed per- 
petuate any initial mi sea I ss i f i cation . 

Rippey, R-M. Probabilistic testing. Journal of Educational Measuvement, 1968, 
5, 21 (-215. 

■ Four tests were administered and scored probabilistically to determine 
whether increases i n re I i ob i I i ty would result. Two scoring functions 
were used: spherical and logarithmic. An increase in reliability was 
observed in the first tes t coup I ed w i th a corresponding increase in ad- 
mi n i st rat ion ti me. Different itomr:, would be retained i n the probabilis- 
tic case on the basis of item analyses. Stereotypical student responses 
were observed, indicotinn that students may have trouble in thinking 

■ probabilistically y/ith respect to more than two classifications. The 
probabilistic score corre I ated lower with an essay test on the same ma- 
terial than did conventional scorinri. in general, thQ results were 
anomalous. 

Rippey, R,M. A comparison of five different scoring functions for confidence 
tests. Journal of Educational Measurement, 1970a, 7_, 165-170. 

Five probabilistic scoring functions were compared on the basis of their 
reliabilities. AM five functions wore apo I i ed to'the sam.e tests. The 
functions were: probab i I i ty ass i qnod to correct answer, 'logarithmic, 
spherical, Euclidean, and inferred choice. The simplest function, the 
probabi I ity assi gned to the correct choice, proved the most re I iab le. 
Inferred choice, which is eaui valent to conventional scoring, was least 
reliable. The conci usions were- that the simplest and most Intuitive-, 
scoring functions were best s i nee they wore most likely to correspond to 
tho expectations of the examinees. 



ERIC 



31 



Rlppey, R.M. Rationale for con f i dence-scorocl multiple-choice tests. Poy- 
chologioal Repovto, 1970b, 27, 91-98. 

If sub joct responses related to incorrplote information, uncertc^in know- 
ledo.o, or decjree of preference are io be sampled, confidence-scoring 
procedures for conventional items or the use of intrinsic items is 
recom.mended. Intrinsic items require a ,di str i but ion of belief over the 
options on a mu H i p I e-choice test and do not have unique correct re- 
sponses. A Euclidean scorinn function scopes intrinsic items on the 
basts of the distance bet'ween the probabilities of the individual and 
the criterion qroup mean for each response. Since items v;hich call for 
uniform distributions of confidence over all responses may not discrimi- 
nate bet\'.'een the informed and the uninformed, a confidence weight on 
the assigned distribution of belief is sugqested. 

Rippey, R.M. • Scoring and analyzing confidence tests. Final report of project 
no> 7-0578, u'.S. Department' of Health, Education and Welfare, 1971. 

The literature leading up to and includinq probabilistic testing is re- 
viewed. Mew features . i nc I ude an entropic scorinq function and a Eucli- 
dean function weight by deriroo of confidence. Three tests with non- 
unique correct answers were devised, and scored with the weighted and tin- 
weighted Euclidean functions, Confidence was extensively correlated 
with sex, grade, and socioeconomic class. 

Roby, T.G. .Belief states and the uses of evidence. Behavioral Sciences y 1965, 
110, 255-270. 

A new notation called B-state or belief state is introduced to facilitate 
.updating prior beliefs with current evidence. Advantages of this approach 
are that (I) quantitative comparison or combination of the beliefs of 
several, individuals or one individual at several time periods is possible 
and (2) the effects of ex1*ernal evidence can be described as mathematical 
operations on the existing belief state. V/ith the necessity for absorb- 
ing new nc-^ation, it is not clear that, the B-state operators are superior 
to Bayes ^ theorem. 

Rorrt)erg, T. et al. Three experiments involving probability measurement proce- 
dures with mathematics test, items. Wisconsin Research and Development • 
Center for Cognitive Learni ng Report No. Tr-129, 1970. ERIC No. 
ED 044 315. 

ERIC Summary: "This is a report from the Project on i ndi vi dual ly Gul ded 
Mathematics, Phase 2 Analysis of^'lritliematics instruction. ■ The report 
outlines some of the cl)arocter I st i cs of probabi lity measurement procedures 
for scoring objective tests, discusses hypothesised advantages and disad- 
vantages of the m.ethods, and reporis the resu I ts of three experi ments desi- 
gned to learn more aboul' the technique .?ind compare it with standard proce- 
dureS"' of scorinq objective tests. The^ procedure used required the stu- 
dents to. specify a degree of belief probability - ^^.ach of the given al-. 
ternatives to a question. The students were f • ^'i It i p ie-choi ce item 

and asked to specify what they believed to be t! '- fw;.i i lity of correctnes 



32 



of each choice. The initial intent of those oxperirronts was to see if a 
non-5tandar(l teG t-tok i'nq nnd scori nq proceduro would provide useful, re- 
liable information for s^uch tes ri:. . The studies indicated that the' problem 
of petting useful, reliable information on difficult testshas not been 
solved.^' 

Scodel, A., Ratoosh, P. and H\ nas , J.S. Some personality correlates of decision- 
making under conditions of risk. Behavioral Science, 1959, 4, 19-28. 

Personality variables are incorporated into the utility-maximization 
model. Risk' takinfi was measured in a gambling situation followi-ng Ed- 
wards' paradigm, in which probability preferences and payoff preferences 
were similarly confounded. The col lege grouped tended to be more conser- 
vative than the military group. Intel licence was inversely related to 
variability in risk-taking, but not related to degree of risk-taking. The 
group choosing low payoffs had more fear of failure and less need for 
achievement than the high or i nterm.edi ate payoff croups. 

Shuford;-E.H. , ; A Ibert, A. and Massengill, f-I.E. Admissible probability measure- 
ment procedures. Psy chanetvika , 1966, 3JI^, 125-145. 

A probabi I istic scoring system for objective tests which allows the stu- 
dent to maximize his/her expecled \jCore if and only if he/she honestly 
reports the degrec-of-bo I i ef probabilities which should have the repor- 
ducing property. NecessarY and sufficient conditions for the scoring 
system to have a roproducinq pronorty are stated and proved. A method 
is. given for generating a class'of functions, both symmetric and asym- 
m.etrfc, possessing tfie reproducing property. Scoring systems are chosen 
which reward i nto I li qent probab i li ty assessments: the more probability 
placed on the correct option, the higher the score. V/ith a minor modi- 
fication the results can be extended to testing situations in which .the 
student has to generate the answer'i; as well as indicate degree of belief. 

Slakter,. M.Jl Riak taking on objective examinations. American Educational- Re- 
search Journal. 1967, 3I"'^3. , 

A model of risk-toklng on objective examinations under conventional di- 
rections is included. Measures of risk-taking used in the past are re- 
viewed, including Swi nefor'd's gambling tendency, the number of omitted 
responses, and Coombs' typo directions. A new measure of risk-taking is 
• proposed. Coombs^ type directions are given, and a number of nonsense 
guostions are inserted into the test. An index is defined, based on the 
number of alternatives in the nonsense questions which are crossed out. 
A correlational study showed the new measures of risk-taking to be re- 
liable. Some evidence for convergent nnH diccrimlnant validity is of- 
fered. 

Slakter, M.J. Generality of r i:^K- faki ng on ob jecti ve- exami nat i ons . Educational 
and Psyaholorjir^al Measurement, 1969, 29^, 115-128. 

The question of whether risk-taking on ob.joctive tests is a general phenome 
non which applies to various kinds of testing situations is examined. The 
m.easure of • ri sk-taking i nvo I ved imbeddi ng. nonsense questions in the test. 
The general ity of the risk-taking factor was supported by the correlations 
.: bteween the risk-taking measures for four tests : mathematics, language, 



V 



33 



Slovic, P. Convernent validation of risk-takinq measures, Joiwnal of Abrioivnal 
cold Social Pinjoholof/ij , 1962, 65^, 68-7 

ThG intercorrelations amonq ^.evoral risk-takinn measures of different 
kinds were examined to determine whether they v.'ere hicih enouqh to pro- 
vide support for convergent validity. The response set measures in- 
cluded the Dot Estimation tost, which reflected speed versus accuracy; 
Word Meanings, whicii measured i ncl us i voness of cateqory width; and Test 
F^isk used a variant of Cocmbs' type di rections and accounted for aambling 
set. Questionnaires used were the -Life Experience Inventory and the Job 
Preference Inventory. Experimental ganib I i no measures were taken with 
the Ret Preference and the Se I f -Credi ti ng test, both of which investi- 
v^ated" vari ance preferences, low intercorrelations (below .35) indicate 
a lack of convergent validity. 

Slovlc, P. Assossirent of risk-takinq behavior. Psychological Bulletin, 1964, 
6j_, 220-233. 

The literature relevant to the validity of various risk-takinq measures 
is -extensively reviewed. The studies are class i fi ed i nto three cate- 
gories: response set and judgmenia! measures, questionnaire measures, 
and probability and variance preference measures. The lack of aqree- 
m.ent in convergent validity might be due to the mu 1 1 i dimens i ona I i ty of 
risk, the subjectivity involved in perceiving risk, or the emotional or 
autonomic response necessary to arouse risk-taking tendencies. The 
bibliography is very ; i net usi ve. 

Stael von Holstei n, .C.-A.S, Measurement of subjective probability. Acta Psy- 
ahologica, 1970, 34, 146-159. 

Scoring rules .are discussed in a highly understandable manner. Proper 
scorinq rules and strictly proper scoring rules are. defined. Criteria 
for selecting one scoring rule over another are mentioned. These in- 
clude Raiffa^s principles of relevance, univariance and strong dis- 
criminabi I i ty . Roberts^ Bayesian nodel for connarl ng probab i I isti c pre- 
dictions is shown to invoke, these three principles, A scoring rule is 
developed that is sensitive to disianco, or orderinqs of the possible 
events .". Th i s rule conflicts with Raiffa's principles. Practical uses of 
scoring rules as feedback devices are presently restricted to the areas 
of rr.eteoro. logy and educational listing. Assessm.ent techniques not based 
on scor i ng ru les are briefly reviewed, i ncl udi nq Wi nk I er's questionnai re 
which uses four. methods to elicit underlying distributions. Toda's 
'Vange betting method"* is mentioned. 

Stan.ley, J.C. and Wang, M.D. V/eightinq test items and test-item options, an. - 
overview of the analytical and omDirical literature. Educatioyial and 
Psijcliolcgical Meaourermp.t , 1970, 30^, 21-35. 

The literature encompassing differential weighting of 1 tem»5 as well as 
options is reviewed. Di f ferenti alweighting of items with the-same 
weights for a I I examinees seems useless. However, two modifications 
; seem promising. Birnbaum dif f erent ta I ly wei ghted i tems .by the levels 



34 



of ability of the oxarnineor^, and.Cloary clGveioped a procedure for using 
individucil ronression woir;!rlri. D! f fornnt I a I weiqhtinq of options was 
oricjinally developed to maximize iho relationship of the Instrument 
with outside criteria. Guttman keyed eacli option aqainst c3 quantitative 
criterion us i nc| the critorlon T.ean of those who chose that option- as the 
scor i ng v;o I qht . A cursory review of tfie personal probability weightings 
of the options is presented, and the approach is recommended with modi- 
fications. 

Swineford, F. ^leasurement of a personality trait. Journal of Educational Psy- 
chology, 1938, 29, 295-300, 

The tendency to nambie, a personality trait affectina objective test 
scores, is measured by i ncorporat i nn an instruction into the testing si- 
tuation whereby the student can claim from two to four points credit for 
each item. The student is ponalized by double the amount of credit 
claimed if the wrong option is chosen. The gambling score is the per- 
centage of errors marked "4" to the total number of error plus o'ne-half 
of the omissions for a true-false test. The qamb t i ng score formula 
yields a re I i ab I e .m.easure of a trait which is independent of achievement 
on the same test. The test should be difficult for this measure to be 
re! iab le. 

SwI neford, . F. Analysis of a personality trait. Journal of Educational Psy- 
■ chology, 1941, 32^,^ 438-444. 

The tendency to gamble was measured on four tests administered to the 
same population. One fourth of fhe 457 students wore eliminated from 
consideration since on at least one test either no extra credits were 
claimed or no errors were made. In eith'^r case no gambling score could 
be computed. Boys exhibited a significantly hiqher tendency to gamble 
than girls, especially on unfamiliar types of tests. More students 
gambled on unfamiliar material. The aarblinq scores were in most cases 
independent of five mental factors and correlated highly with each 
other. 

van Naerssen, R.F. A scale for the measurement of subjective probability. 
Acta Psychologica , 1962, 20i, 159-166. 

To avoid measuring sub jecti ve probab i I i ty by the more cumbersome method 
of paired comparisons, the subjeci' or selector has to choose between a 
number of ordered pairs at the same tim^e. A type of scale Is developed 
with a quadratic solution. Aoplications are the nieasurem.ent of subjec- 
tive probabilities as in assessing level of aspiration or predict! nn suc- 
cess" or failure for a candidate and the increasing of the reliability of 
two choice tests . 

Wi nk ler, R. L. The .assessment of. prior di si r i but ions i n Qayes i an ana !ys i s . 

Journal of the American Statistical Acsociation, 1957a, 62, 776-800. 

University of Chicane students were questioned using four techniques: 
Cumulati ve Distribution Function, Hypothetical Future Samples, Equlva- 



35 

loot Prior Sample Itifornntion, and Probability Density Function, i n or- 
dt^r to elicit enouqh informntion to writo down their prior distributions. 
Subjects" hnd trouble with CDF^:i but in cieneral loomed to assess prior 
distributions on their own. /\ revised questionnaire is presented in the 
appendi x. 

Winkler, ^^.L. The quantification of judarnent: some exoorimenta I results. Pro- 
CG^^dinos of the American Statistical Association, March 1967b, pp. 386- 
-395. 

The efficacy of scorinn rules cind bets in keepinn assessors of subjec- 
tive probabilities honest and providinq thein with feedback is investi- 
gated experimontolly. The 13 week study involved the weekly assessments 
of various probabilities and the expected point spread of weekend football 
games. The sub.jects wore qiven feedback from two scoring rules, the qua- 
dratic evaluai'ino their probabi I i t les and a sauared-er ror loss evaluating 
their spread. They were then given a chance to make bets on the basis of 
their probability assignments. The scorinq rules and bets seemed to lead 
the assessors to make careful assessments. A consensus of assessors com- 
pared favorably to the performance of the individuals comprising the con- 
sensus. 

Winkler, R.L. The quantification of judgment: some methodoloqi ca I suggestions. 
Journal of the Amencan Statistical Aooociation, 1967c, 62_, I 105-1 120. 

An ideal assessor of personal probability, who never violates the postu- 
lates of coherence, is imagined to be faced with choices of bets. In 
order to force true responses as to his oersonal probabilities, a pen- 
alty or scoring function must encourage revelation of the probabilities. 
Four proper scorinq rules are described': de Finetti's rule, the "Brier 
score," the spherical gain, and the lonarlthmic loss. The implications 
and practicality of these rriethods are discussed. . 

Winkler, R.L, and MurDlw, A.M. "Hood'' probability assessors. Journal of Applied 
rleteorology , 1963, 7, 751-758. 

A framework for evaluating meteorologists who assess probabilities must 
be consistent w i th. tlie theory of subjective probability* Two standards 
of "goodness^' are described normative, which requires the assessor to obey 
thepostulates of coherence and make honest assessments, and substantive, 
concerned with knowledge of the subject and reflected in the degree of 
association between the predictions and the observations. Three proper 
scoring rules .are discussed quadratic, spherical, and logarithmic. 
The loqarithmic scoring rule only considers the probability of the out- 
come that occurs, v/hi le .the other two are concerned with al I +he proba- 
bi titles. Proper scoiMno rules encourage assessors to be honest, permit 
evaluation of- assessors, and help individuals become better assessors. 
Proper scoring rules may not yield consir^tont results, since they may 
not assess the same aspects of the attr i bute va I i di ty . Rankings based 
on average scores may be reasonably consistent. 



ERIC 



36 

Winkler, R.L. ronddurphy, A,H. Nonlinonr* utility and the probability score. 
Joiavial of AppHcd I'cteoroloaif, 1970, 9, 143-148. 

Proper scorina rulos arssuno that the yns.cssor has a linear utility func- 
tion^ If the utility function isi actually nonlinear, as in the cases of 
a risk-taker and n ri sk-avoi der, factors other than the expected score 
may affect the nrobabllity forecasts. The expected utility is found to 
depend on Ihe variance of the score as well as the expected orobability 
score for the risk-taker. The optimal forecast for an extreme risk-taker 
would be to assign the event probability one if the assessor's actual sub- 
jective probability, Dj, were greater than one-half and zero If p. were 

less than one-half. A risk-avoider is presured to prefer a small vari- 
ance to a*' larqo one. An extreme risk-avoider would prefer probabilities 
close to one-half, if the assessor's utility function can be specified, 
it should bo incorporated into the assessment process by defining a new 
rule, a composite of the original rule and the utility function. If 
the utility function cannot be determined, the assessor's statements may 
differ from the true subjective probability judgments. 

Ziller, R.C. A measure of the qamb I i nq response-set- in objective tests. Psjy-. 
chornetrika, 1957, ,22, 289-292, \ ' 

ft 

A formula for meosurino risk'-taking or qamb lino set in objective tests ■ 
is developed. The index of risk-acceptance depends on the number of 
alternatives the number of incorrect responses, and the number of omis- 
sions. The index is designed for tests in which examinees are informed 
. that a correction for guessing will be applied. A few implications of 
this measure for test theory and construci'ion are discussed. 



ERIC 



